8 research outputs found
Automatic parallel implementations of adjoint codes for structured mesh applications
Algorithmic Differentiation (AD) shown to be an essential tool to get sensitivity information for va in multiple areas of science such as Computational Fluid Dynamics (CFD) applications or finance. Yet there is no sufficient tool to ease the cost of providing performance portable AD codes, especially for modern hardware like GPU clusters. This paper sketches our plans and progress so far to extend the OPS framework with an adjoint tape (storage for descriptors of intermediate steps and intermediate states of variables) and shows preliminary performance results on CPU nodes. The OPS (Oxford Parallel library for Structured mesh solvers) has shown good performance and scaling on a wide range of HPC architectures. Our work aims to exploit the benefits of OPS to provide performance portable adjoint implementations for future structured mesh stencil applications using OPS with minimal modifications
Bitwise Reproducible task execution on unstructured mesh applications
Many mesh applications use floating point arithmetic which do not necessarily hold the associative laws of algebra. This could cause the application to become unreproducible. In this paper we present some work on generating a method for unstructured mesh applications to provide bitwise reproducibility between separate runs, even if they are started with different number of MPI processes. We implement our work in the OP2 domain-specific library, which provides an API that abstracts the solution of unstructured mesh computations. We carry out a performance analysis of our method applied on two applications: a simple airfoil application, and a more complex Aero application which uses a finite element method and a conjugate-gradient algorithm. We show a 2.37Ă—to 1.49Ă— slowdown on this applications as a price for full bitwise reproducibility
Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
The key common bottleneck in most stencil codes is data movement, and prior
research has shown that improving data locality through optimisations that
schedule across loops do particularly well. However, in many large PDE
applications it is not possible to apply such optimisations through compilers
because there are many options, execution paths and data per grid point, many
dependent on run-time parameters, and the code is distributed across different
compilation units. In this paper, we adapt the data locality improving
optimisation called iteration space slicing for use in large OPS applications
both in shared-memory and distributed-memory systems, relying on run-time
analysis and delayed execution. We evaluate our approach on a number of
applications, observing speedups of 2 on the Cloverleaf 2D/3D proxy
application, which contain 83/141 loops respectively, on the linear
solver TeaLeaf, and on the compressible Navier-Stokes solver
OpenSBLI. We demonstrate strong and weak scalability up to 4608 cores of
CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's
Knights Landing, demonstrating maintained throughput as the problem size grows
beyond 16GB, and we do scaling studies up to 8704 cores. The approach is
generally applicable to any stencil DSL that provides per loop data access
information
An abstraction for local computations on structured meshes and its extension to handling multiple materials
Computations involving a neighbourhood on structured meshes represents a wide class of applications that includes the simulation of cellular automata, and the solution of partial differential equations (PDEs). In this paper we present an abstraction for describing such computations at a high level, allowing fast experimentation and productivity. The abstraction is designed such that it can be automatically converted to various high-performance implementations. A critical feature of this abstraction is an extension to support a varying number of materials, or species, at each grid point, enabling much more complex simulations
Beyond 16GB: Out-of-Core Stencil Computations
Stencil computations are a key class of applications, widely used in the
scientific computing community, and a class that has particularly benefited
from performance improvements on architectures with high memory bandwidth.
Unfortunately, such architectures come with a limited amount of fast memory,
which is limiting the size of the problems that can be efficiently solved. In
this paper, we address this challenge by applying the well-known cache-blocking
tiling technique to large scale stencil codes implemented using the OPS domain
specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We
introduce a number of techniques and optimisations to help manage data resident
in fast memory, and minimise data movement. Evaluating our work on Intel's
Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is
possible to solve 3 times larger problems than the on-chip memory size with at
most 15\% loss in efficienc